nmer_OriginalHomeworkCode_03

5 Challenges

  1. Received this error while trying to run a qqplot for years educated: Error in hist.default(a, breaks = seq(0, 9, 1), probability = TRUE, main = “Probability of Years Educated”, : some ‘x’ not counted; maybe ‘breaks’ do not span range of ‘x’. I realized before I could even google it that it was because I had not swapped out my data varialbe form the last qqplot! I switched it from a for age to ye for years educated and it ran.
  2. At first I was not sure exactly how to plot the histograms in question 4. I looked back in previous modules and realized it was very similar to simulating a distrubtion to rnorm then plotting those results in a histogram, but here we already had a distribution in the various quantitative variables.I used max() and min() for each variable to determine the x axis range and breaks.
  3. I could not get the green run code button to work on my last two chunks for my histograms/qqplots in question 4. I tried knitting it to see if I could see anything in the rendered version that showed any problems or if ran the code there. I noticed an erroneous ``` in the text immediately preceding the problematic chunks. I went back to the source and deleted it. When I did that, I was able to run the two chunks.
  4. Got this error when trying to do sample(d, 30, replace=FALSE): “Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when ‘replace = FALSE’”. I googled it and it seems like using a dplyr function (slice_sample()) is easier than trying to make sample() work in this context. I was able to “slice” 30 rows of the original 1000 in one line of code. I also decided to use set.seed() so that the data would be the same every time it is run.
  5. I originally switched qnorm for qpois to do the CI but the upper and lower turned out the same values. I realized I needed to calculate the Standard Errors differently for those. However, that did not fix it. I searched online and found a lot of places saying to get an exact poisson CI, you need to use chi squared with alpha values. I found one website which had a formula for this. I used that formula but left my original method in the chunks in case I want to workshop it more. Here is where I found the formula: https://stats.stackexchange.com/questions/10926/how-to-calculate-confidence-interval-for-count-data-in-r Bonus 6th challenge: See Question 6 for graveyard of abandoned code trying to sample the data 99 times. I continued trying to use slice_sample(d) with the replicate function which replicated the data set 99 times with vectors of length 30 in each cell. I could not subset variables from it and take means, even with for loops because things were not the correct class or data type. Don’t write code late at night! I talked about it with Jimmy and we agreed the general idea was right but instead needed to go back to regular old sample() and subset the data to each variable within sample. Then use mean around sample, then replicate around the whole thing with 99. Smooth sailing after that.
file <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv')
d <- read.csv(file, header= TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
##   id first_name last_name gender   height   weight zombies_killed
## 1  1      Sarah    Little Female 62.88951 132.0872              2
## 2  2       Mark    Duncan   Male 67.80277 146.3753              5
## 3  3    Brandon     Perez   Male 72.12908 152.9370              1
## 4  4      Roger   Coleman   Male 66.78484 129.7418              5
## 5  5      Tammy    Powell Female 64.71832 132.4265              4
## 6  6    Anthony     Green   Male 71.24326 152.5246              1
##   years_of_education                           major      age
## 1                  1                medicine/nursing 17.64275
## 2                  3 criminal justice administration 22.58951
## 3                  1                       education 21.91276
## 4                  6                  energy studies 18.19058
## 5                  3                       logistics 21.10399
## 6                  4                  energy studies 21.48355
str(d)
## 'data.frame':    1000 obs. of  10 variables:
##  $ id                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ first_name        : chr  "Sarah" "Mark" "Brandon" "Roger" ...
##  $ last_name         : chr  "Little" "Duncan" "Perez" "Coleman" ...
##  $ gender            : chr  "Female" "Male" "Male" "Male" ...
##  $ height            : num  62.9 67.8 72.1 66.8 64.7 ...
##  $ weight            : num  132 146 153 130 132 ...
##  $ zombies_killed    : int  2 5 1 5 4 1 0 4 9 2 ...
##  $ years_of_education: int  1 3 1 6 3 4 4 0 3 3 ...
##  $ major             : chr  "medicine/nursing" "criminal justice administration" "education" "energy studies" ...
##  $ age               : num  17.6 22.6 21.9 18.2 21.1 ...

Question 1

Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education). NOTE: You will not want to use the built in var() and sd() commands as these are for samples.

h <- d$height
mh <- mean(h)
mh
## [1] 67.6301
w <- d$weight
mw <- mean(w)
mw
## [1] 143.9075
a <- d$age
ma <- mean(a)
ma
## [1] 20.04696
zk <- d$zombies_killed
mzk <- mean(zk)
mzk
## [1] 2.992
ye <- d$years_of_education
mye <- mean(ye)
mye
## [1] 2.996

For population standard deviation, first will define a function for population variance, not sample variance. This is the sum of squares/n.

pop_v <- function(x) {
    sum((x - mean(x))^2)/length(x)
}
pop_v(h)
## [1] 18.55861
pop_v(w)
## [1] 338.2604
pop_v(a)
## [1] 8.782822
pop_v(zk)
## [1] 3.053936
pop_v(ye)
## [1] 2.807984

To get the population SD, I just need to define a function taking the square root of the population variance

pop_sd <- function(x) {
    sqrt(pop_v(x))
}
pop_sd(h)
## [1] 4.30797
pop_sd(w)
## [1] 18.39186
pop_sd(a)
## [1] 2.963583
pop_sd(zk)
## [1] 1.747551
pop_sd(ye)
## [1] 1.675704

Question 2

Use {ggplot} to make boxplots of each of these variables by gender.

ph <- ggplot(data = d, aes(x = gender, y = h))
ph <- ph + geom_boxplot()
ph <- ph + theme(axis.text.x = element_text(angle = 90))
ph <- ph + ylab("Heights of Pop")
ph

pw <- ggplot(data = d, aes(x = gender, y = w))
pw <- pw + geom_boxplot()
pw <- pw + theme(axis.text.x = element_text(angle = 90))
pw <- pw + ylab("Weights of Pop")
pw

pa <- ggplot(data = d, aes(x = gender, y = a))
pa <- pa + geom_boxplot()
pa <- pa + theme(axis.text.x = element_text(angle = 90))
pa <- pa + ylab("Ages of Pop")
pa

pzk <- ggplot(data = d, aes(x = gender, y = zk))
pzk <- pzk + geom_boxplot()
pzk <- pzk + theme(axis.text.x = element_text(angle = 90))
pzk <- pzk + ylab("Number of Zombies Killed")
pzk

pye <- ggplot(data = d, aes(x = gender, y = ye))
pye <- pye + geom_boxplot()
pye <- pye + theme(axis.text.x = element_text(angle = 90))
pye <- pye + ylab("Years Educated")
pye

Question 3

Use {ggplot} to make scatterplots of height and weight in relation to age. Do these variables seem to be related? In what way?

ph1 <- ggplot(data = d, aes(x = a, y = h, color = factor(gender)))
ph1 <- ph1 + xlab("Age") + ylab("Height")
ph1 <- ph1 + geom_point()
ph1 <- ph1 + theme(legend.position = "bottom", legend.title = element_blank())
ph1

pw1 <- ggplot(data = d, aes(x = a, y = w, color = factor(gender)))
pw1 <- pw1 + xlab("Age") + ylab("Weight")
pw1 <- pw1 + geom_point()
pw1 <- pw1 + theme(legend.position = "bottom", legend.title = element_blank())
pw1

Looks like there is a relationship between age and hieght- height is dependent on age. As age goes up, height also goes up. However, there is no such relationship between age and weight.

Question 4

Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?

par(mfrow = c(1,2))
min(h)
## [1] 54.14948
max(h)
## [1] 80.5298
hist(h, breaks = seq(50,90,1), probability= TRUE, main = "Probability of Height of Pop", xlab = "Height", ylab = "probability")
qqnorm(h, main = "Normal QQ Plot for Height of Pop")
qqline(h, col = "gray")

par(mfrow = c(1,2))
min(w)
## [1] 90.29148
max(w)
## [1] 210.7895
hist(w, breaks = seq(80,220,1), probability= TRUE, main = "Probability of Weight of Pop", xlab = "Weight", ylab = "probability")
qqnorm(w, main = "Normal QQ Plot Weight of Pop")
qqline(w, col = "gray")

par(mfrow = c(1,2))
min(a)
## [1] 10.66381
max(a)
## [1] 29.59488
hist(a, breaks = seq(9,30,1), probability= TRUE, main = "Probability of Age of Pop", xlab = "Age", ylab = "probability")
qqnorm(a, main = "Normal QQ Plot Age of Pop")
qqline(a, col = "gray")

par(mfrow = c(1,2))
min(zk)
## [1] 0
max(zk)
## [1] 11
hist(zk, breaks = seq(-1,12,1), probability= TRUE, main = "Probability of Zombie Kills", xlab = " of Zombie Kills", ylab = "probability")
qqnorm(zk, main = "Normal QQ Plot of Zombie Kills")
qqline(zk, col = "gray")

This is a poisson distribution which makes sense because it is a similar scenario to the titi calls. How many kills are observed?

par(mfrow = c(1,2))
min(ye)
## [1] 0
max(ye)
## [1] 8
hist(ye, breaks = seq(0,9,1), probability= TRUE, main = "Probability of Years Educated", xlab = "Years", ylab = "probability")
qqnorm(ye, main = "Normal QQ Plot Years Educated")
qqline(ye, col = "gray")

This is also a poisson distribution, for similar reasons as above.

Question 5

Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…

set.seed(1)
thirtyz <- slice_sample(d, n = 30, replace = FALSE) #this is an easier dplyr function
h <- thirtyz$height
mh <- mean(h)
mh
## [1] 66.88522
w <- thirtyz$weight
mw <- mean(w)
mw
## [1] 142.2536
a <- thirtyz$age
ma <- mean(a)
ma
## [1] 19.68657
zk <- thirtyz$zombies_killed
mzk <- mean(zk)
mzk
## [1] 3.066667
ye <- thirtyz$years_of_education
mye <- mean(ye)
mye
## [1] 2.966667

Can use built in function this time because it is a sample

sd(h)
## [1] 3.735037
sd(w)
## [1] 16.22658
sd(a)
## [1] 2.960476
sd(zk)
## [1] 1.680175
sd(ye)
## [1] 1.691425

The standard error can be calculated many different ways. Can divide each sd by 30, can create my own function, or I can use the sciplot built in function. I will load in sciplot and use that function. Standard error for the poisson I will do separetly since it just sqrt(lambda/n)

se(h)
## [1] 0.6819214
se(w)
## [1] 2.962555
se(a)
## [1] 0.5405065
seZK <- sqrt(mzk/length(zk))
seZK
## [1] 0.3197221
seYE <- sqrt(mye/length(ye))
seYE
## [1] 0.314466

This sample is relatively close to the population. Most variables are at less than one standard error. Weight is almost 3 standard errors away, however.

upperH <- mh + qnorm(0.975, mean = 0, sd = 1) * se(h)
lowerH <- mh + qnorm(0.025, mean = 0, sd = 1) * se(h)
ciH <- c(lowerH, upperH)
ciH
## [1] 65.54868 68.22176
upperW <- mw + qnorm(0.975, mean = 0, sd = 1) * se(w)
lowerW <- mw + qnorm(0.025, mean = 0, sd = 1) * se(w)
ciW <- c(lowerW, upperW)
ciW
## [1] 136.4471 148.0601
upperA <- ma + qnorm(0.975, mean = 0, sd = 1) * se(a)
lowerA <- ma + qnorm(0.025, mean = 0, sd = 1) * se(a)
ciA <- c(lowerA, upperA)
ciA
## [1] 18.62719 20.74594
upperZK <- mzk + qpois(0.975, lambda = 0) * seZK
lowerZK <- mzk + qpois(0.025, lambda = 0) * seZK
ciZK <- c(lowerZK, upperZK)
ciZK
## [1] 3.066667 3.066667
#this obviously doesn't work since it returns the same thing for both upper and lower
exactPoiCI <- function (X, conf.level=0.95) {
  alpha = 1 - conf.level
  upper <- 0.5 * qchisq((1-(alpha/2)), (2*X)) #chi square necessary here
  lower <- 0.5 * qchisq(alpha/2, (2*X +2))
  return(c(lower, upper))
}
exactPoiCI(mzk, conf.level = 0.95) #0.95 is default but good to show what we are asking for
## [1] 1.123747 7.330303
upperYE <- mean(ye) + qpois(0.975, lambda = 0) * seYE
lowerYE <- mean(ye) + qpois(0.025, lambda = 0) * seYE
ciYE <- c(lowerYE, upperYE)
ciYE
## [1] 2.966667 2.966667
exactPoiCI(mye)
## [1] 1.073027 7.171705

Question 6

Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?

Here’s what I tried first and it created a massive matrix with every cell containing a vector:

x <- replicate(99, slice_sample(d, n = 30, replace = FALSE)) #I refused to use a for loop here x <- t(x) #transposed so that the column names are the variables as.data.frame(x)

I blame slice_sample now!

Graveyard of attempted code to get means from my overly complicated matrix:

mHeight <- lapply(x[1:99, ], mean(x[, 5])) mWeight <- mean(x[6, 1:99]) mZombies_Killed <- mean(x[7, 1:99]) mYears_Ed <- mean(x[8, 1:99]) mAge <- mean(x[10, 1:99]) # heights <- x[1:99, 5] mHeights <- NULL for (i in heights) { mHeights[i] <- mean(x$height) }

Instead I am going to sample each variable independently (and that way I shouldn’t need to use slice_sample either) and take the means directly in that line of code.

set.seed(2)
mHeight <- replicate(99, mean(sample(d$height, 30, replace = F))) #taking a sample of heights, averaging, then repeating 99 more times
mWeight <- replicate(99, mean(sample(d$weight, 30, replace = F)))
mAge <- replicate(99, mean(sample(d$age, 30, replace = F)))
mZombies_Killed <- replicate(99, mean(sample(d$zombies_killed, 30, replace = F)))
mYears_Ed <- replicate(99, mean(sample(d$years_of_education, 30, replace = F)))

Combine the first sample with the new ones:

mHeights <- c(mh, mHeight)
mWeights <- c(mw, mWeight)
mAges <- c(ma, mAge)
mKills <- c(mzk, mZombies_Killed)
mEducation <- c(mye, mYears_Ed)

Means of Means, and SDs of Means

mean(mHeights)
## [1] 67.653
mean(mWeights)
## [1] 143.4503
mean(mAges)
## [1] 20.1015
mean(mKills)
## [1] 3.033333
mean(mEducation)
## [1] 3.005333
sd(mHeights)
## [1] 0.8021361
sd(mWeights)
## [1] 3.178964
sd(mAges)
## [1] 0.5241518
sd(mKills)
## [1] 0.309556
sd(mEducation)
## [1] 0.2536256
# Standard errors from question 5, the one sampling distribution
se(h)
## [1] 0.6819214
se(w)
## [1] 2.962555
se(a)
## [1] 0.5405065
seZK
## [1] 0.3197221
seYE
## [1] 0.314466

Except for weights, there standard deviation values are pretty small suggesting that there is little variance. The standard errors from the single sampling in question 5 were pretty small as well (except for weights again) which suggests that they were fairly representative of the population. The standard deviations are pretty close to each of their respective standard errors which is indicative of one of the rules of the central limit theorem - that the standard deviation will be nearly equal to the standard error of the mean.

Below I have plotted each sampling distribution, with two lines: - the black line is the mean of the sampling distributions -the red is the mean of the population For the central limit theorem, the disitribution should be approximately centered on the population mean. I also included qqplots to show that they are normally distributed. I opted not go do par() because it made the plots too small to read.

hist(mHeights, probability = T)
abline(v = mean(mHeights))
abline(v= mean(d$height), col = "red")

qqnorm(mHeights, main = "Normal QQ Plot for Sampled Mean Heights")
qqline(mHeights, col = "gray")

hist(mWeights, probability = T)
abline(v = mean(mWeights))
abline(v= mean(d$weight), col = "red")

qqnorm(mWeights, main = "Normal QQ Plot for Sampled Mean Weights")
qqline(mWeights, col = "gray")

hist(mAges, probability = T)
abline(v = mean(mAges))
abline(v= mean(d$age), col = "red")

qqnorm(mAges, main = "Normal QQ Plot for Sampled Mean Ages")
qqline(mAges, col = "gray")

hist(mKills, probability = T)
abline(v = mean(mKills))
abline(v= mean(d$zombies_killed), col = "red")

qqnorm(mKills, main = "Normal QQ Plot for Sampled Mean Zombie Kills")
qqline(mKills, col = "gray")

hist(mEducation, probability = T)
abline(v = mean(mEducation))
abline(v= mean(d$years_of_education), col = "red")

qqnorm(mEducation, main = "Normal QQ Plot for Sampled Mean Years Educated")
qqline(mEducation, col = "gray")

They are all normally distributed, including years educated and zombies killed which were poisson distributed before. This demonstrates the central limit theorem!